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Abstract: 


JTL A 


Automated Essay Scoring (AES) is defined as the computer technology that evaluates and 
scores the written prose (Shermis & Barrera, 2002; Shermis & Burstein, 2003; Shermis, 
Raymat, & Barrera, 2003). AES systems are mainly used to overcome time, cost, reliability, 
and generalizability issues in writing assessment (Bereiter, 2003; Burstein, 2003; Chung 
& O’Neil, 1997; Hamp-Lyons, 2001; Myers, 2003; Page, 2003; Rudner 8t Gagne, 2001; 
Rudner & Liang, 2002; Sireci & Rizavi, 1999). AES continues attracting the attention 
of public schools, universities, testing companies, researchers and educators (Burstein, 
Kukich, Wolff, Lu, & Chodorow, 1998; Shermis & Burstein, 2003; Sireci & Rizavi, 1999). 
The main purpose of this article is to provide an overview of current approaches to AES. 
The article will describe the most widely used AES systems including Project Essay Grader™ 
(PEG), Intelligent Essay Assessor™ (IEA), E-rater® and Criterion™, IntelliMetric™ and MY 
Access!®, and Bayesian Essay Test Scoring System™ (BETSY). It will also discuss the main 
characteristics of these systems and current issues regarding the use of them both in low- 
stakes assessment (in classrooms) and high-stakes assessment (as standardized tests). 
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Introduction 

Automated Essay Scoring (AES) is defined as the computer technology 
that evaluates and scores the written prose (Shermis & Barrera, 2002; 
Shermis & Burstein, 2003; Shermis, Raymat, & Barrera, 2003). AES sys- 
tems are developed to assist teachers in low-stakes classroom assessment 
as well as testing companies and states in large-scale high-stakes assess- 
ment. They are mainly used to help overcome time, cost, reliability, and 
generalizability issues in writing assessment (Bereiter, 2003; Burstein, 
2003; Chung & O’Neil, 1997; Hamp-Lyons, 2001; Myers, 2003; Page, 2003; 
Rudner & Gagne, 2001; Rudner & Liang, 2002; Sireci & Rizavi, 1999). 

A number of studies have been conducted to assess the accuracy and 
reliability of the AES systems with respect to writing assessment. The 
results of several AES studies reported high agreement rates between 
AES systems and human raters (Attali, 2004; Burstein & Chodorow, 1999; 
Landauer, Laham, & Foltz, 2003; Landauer, Laham, Rehder, & Schreiner, 
1997; Nichols, 2004; Page, 2003; Vantage Learning, 2000a, 2000b, 2001b, 
2002, 2003a, 2003b). 

AES systems have been criticized for lacking human interaction (Hamp- 
Lyons, 2001), vulnerability to cheating (Chung & O’Neil, 1997; Kukich, 
2000; Rudner & Gagne, 2001), and their need for a large corpus of sample 
text to train the system (Chung & O’Neil, 1997). Despite its weaknesses, 
AES continues attracting the attention of public schools, universities, 
testing companies, researchers and educators (Burstein, Kukich, Wolff, Lu, 
& Chodorow, 1998; Shermis 8t Burstein, 2003; Sireci & Rizavi, 1999). 

The purpose of this article is to provide an overview of current 
approaches to automated essay scoring. The next section will describe the 
most widely used AES systems: Project Essay Grader™ (PEG), Intelligent 
Essay Assessor™ (IEA), E-rater® and Criterion™, IntelliMetric™ and 
MY Access!®, and Bayesian Essay Test Scoring System™ (BETSY). The 
final section will summarize the main characteristics of AES systems and 
will discuss current issues regarding the use of these systems both in 
low-stakes assessment (in classrooms) and high-stakes assessment (as 
standardized tests). 
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Automated Essay Scoring Systems 

Project Essay Grader™ (PEG) 

Project Essay Grader™ (PEG) was developed by Ellis Page in 1966 upon 
the request of the College Board, which wanted to make the large-scale 
essay scoring process more practical and effective (Rudner & Gagne, 2001; 
Page, 2003). PEG™ uses correlation to predict the intrinsic quality of the 
essays (Chung & O’Neil, 1997; Kukich, 2000; Rudner & Gagne, 2001). 

Page and his colleagues use the terms trims and proxes while explaining 
the way PEG™ generates a score. While trins refer to the intrinsic variables 
such as fluency, diction, grammar, punctuation, etc., proxes denote the 
approximation (correlation) of the intrinsic variables. Thus, proxes refer 
to actual counts in an essay (e.g., establishing the correlation of fluency or 
trin with the amount of vocabulary or prox; Page, 1994). 

The scoring methodology PEG™ employs is simple. The system con- 
tains a training stage and a scoring stage. PEG™ is trained on a sample 
of essays in the former stage. In the latter stage, proxy variables (proxes) 
are determined for each essay and these variables are entered into the pre- 
diction equation. Finally, a score is assigned by computing beta weights 
(coefficients) from the training stage (Chung & O’Neil, 1997). PEG™ needs 
100 to 400 sample essays for training purposes (BETSY, n.d.). 

One of the strengths of PEG™ is that the predicted scores are compa- 
rable to those of human raters. Furthermore, the system can computation- 
ally track the writing errors made by the users (Chung & O’Neil, 1997). 
However, PEG™ has been criticized for ignoring the semantic aspect of 
essays and focusing more on the surface structures (Chung & O’Neil, 
1997; Kukich, 2000). By failing to detect the content related features of an 
essay (organization, style etc.), the system does not provide instructional 
feedback to students. An early version was found to be weak in terms 
of scoring accuracy. For example, since PEG™ used indirect measures of 
writing skill, it was possible to “trick” the system by writing longer essays 
(Kukich, 2000). PEG™ went through changes in 1990s (Kukich, 2000) and 
several aspects of PEG™ were modified including not only several parsers 
and various dictionaries, but also special collections and classification 
schemes (Page, 2003; Shermis & Barrera, 2002). 
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Intelligent Essay Assessor™ (IEA) 

Another AES system, Intelligent Essay Assessor™ (IEA), analyzes 
and scores an essay using a semantic text-analysis method called Latent 
Semantic Analysis (LSA; Lemaire 8c Dessus, 2001), which is an approach 
created by psychologist Thomas Landauer with the assistance of Peter 
Foltz and Darrell Laham (Murray, 1998). IEA™ is produced by the Pearson 
Knowledge Analysis Technologies (PKT; Psotka 8c Streeter, 2004, p.2; PKT, 
n.d.). More detailed descriptions of LSA and IEA™ are provided below. 

Latent Semantic Analysis 

Latent Semantic Analysis (LSA) is defined as “a statistical model of 
word usage that permits comparisons of the semantic similarity between 
pieces of textual information” (Foltz, 1996, p. 2). LSA first processes a 
corpus of machine-readable language and then represents the words that 
are included in a sentence, paragraph, or essay through statistical com- 
putations (Landauer, Laham, 8c Foltz, 1998). LSA measures of similarity 
are considered highly correlated with human meaning similarities among 
words and texts. Moreover, it successfully imitates human word selection 
and category judgments (Landauer, Laham, 8c Foltz, 2003). The underlying 
idea is that the meaning of a passage is very much dependent on its words 
and changing even only one word can result in meaning differences in the 
passage. On the other hand, two passages with different words might have 
a very similar meaning (Landauer et al., 2003). The underlying idea can be 
summarized as: 

“meaning of wordi + meaning of word 2 + + meaning of word k = 

meaning of passage” (Landauer et al., 2003, p. 88). 

The educational applications of LSA include picking the most suitable 
text for students with different levels of background knowledge, auto- 
matic scoring of essay contents, and assisting students in summarizing 
texts successfully (Landauer et al., 1998). In order to evaluate the overall 
quality of an essay, LSA needs to be trained on domain-representative 
texts (texts that best represent the writing prompt). Then the essay needs 
to be characterized by LSA vectors (a mathematical representation of the 
essay). Finally, the conceptual relevance and the content of the essay are 
compared to other texts (Foltz, Laham, 8c Landauer, 1999; Landauer et al., 
1998). 

In the LSA based approach, the text is represented as a matrix. Each 
row in the matrix represents a unique word, while each column repre- 
sents context. Each cell involves the frequency of the word. Then, each cell 
frequency is considered by a feature that denotes not only the importance 
of the word in that context but also the degree to which the word type 
carries information in the domain discourse (Landauer et al., 1998). The 
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semantics of a word are verified through all the contexts in which the word 
occurs. The number of occurrences of each word in a text determines its 
semantic space. For example, 300 paragraphs and 2000 words provide a 
300x2000 matrix. Here, while each word is represented by a 300-dimen- 
tional vector, each paragraph is represented by a 2000-dimentional vector. 
By reducing these dimensions, LSA induces semantic similarities between 
words. This reduction is critical since it permits the representation of 
word meanings through the context in which they occur. The number of 
dimensions is also crucial. That is, if the number is too small, much of the 
information will be lost. On the contrary, if the number is too big, limited 
dependencies will be drawn between vectors. According to this method, 
the semantic information is determined only through the co-occurrence of 
words in a large corpus of texts (Lemaire & Dessus, 2001). 

Intelligent Essay Assessor™ (IEA) 

Unlike other AES systems, IEA™’s main focus is more on the content 
related features rather than the form related ones; however, this does not 
mean that IEA™ provides no feedback on formal aspects (e.g., grammar 
and punctuation) in an essay. In other words, even though the system 
uses an LSA-based approach to evaluate mainly the quality of the content 
of an essay, it also includes scoring and feedback on grammar, style and 
mechanics (Landauer, Laham, & Foltz, 2000; Landauer, Laham, & Foltz, 
2003; Streeter, Psotka, Laham, & MacCuish, 2004). Figure 1 (next page) 
shows an example of the feedback feature provided by IEA™ (PKT, n.d.). 
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Figure 1 : Sample Feedback Screen of IEA™ 1 
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Landauer et al. (2003) claim that IEA™ can successfully analyze not 
only content-based essays, but also creative narratives. The system needs 
to be trained on a set of domain-representative texts in order to measure 
the overall quality of an essay. For example, a biology book can be used 
to evaluate a biology essay. IEA™ uses three sources to analyze an essay: 
“a) pre-scored essays of other students, b) expert model essays and knowl- 
edge source materials, c) internal comparison of an unscored set of essays” 
(Landauer et al., 2003, p. 90). This approach allows IEA™ to compare each 
essay with similar texts in terms of the content quality (Landauer et al., 
2000; Landauer et al., 2003; Streeter et al., 2004). First, IEA™ compares 
content similarity between a student’s essay and other essays on the 
same topic scored by human raters to determine how closely they match 
(Landauer et al., 2000; Rudner 8t Gagne, 2001; Streeter et al., 2004). It 
then predicts the overall score by adding a “corpus-statistical writing-style” 
and mechanics (Landauer et al., 2000, p. 28). It also spots plagiarism and 
provides feedback (Landauer et al., 2000; Landauer et al., 2003). 

As part of the usual procedure of IEA™, each essay is compared to every 
other one in a set. The essays that are extremely similar to each other are 
examined by LSA. Regardless of substitution of synonym, paraphrasing, 
or rearrangement of sentences, the two essays will be similar with LSA 
(Landauer et al., 2003). Detecting plagiarism is an essential feature since 
this type of academic dishonesty is quite hard to detect by human raters, 
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particularly when grading large number of essays (Shermis, Raymat, & 
Barrera, 2003). The structure of IEA™ is presented in Figure 2 (Landauer 
et al., 2003, p.90). 

Figure 2: The Intelligent Essay Assessor™ Architecture 
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Landauer et al. (2000) point out the basic technical difference between 
IEA™ and other AES systems as follows: 

Other systems work primarily by finding essay features they can 
count and correlate with ratings human graders assigned. They 
determine a formula for choosing and combining the variables that 
produces the best results on the training data. They then apply this 
formula to every to-be-scored essay. What principally distinguishes 
IEA is its LSA-based direct use of evaluations by human experts of 
essays that are very similar in semantic content. This method, called 
vicarious human scoring, lets the implicit criteria for each individual 
essay differ (p.28). 

The producers of IEA™, Pearson Knowledge Technologies (PKT), report 
that the system needs smaller numbers of pre-scored essays to train. 
Unlike other AES systems, which require 300-500 training essays per 
prompt, IEA™ only requires 100 pre-scored essays (PKT, n.d.; Landauer 
et al., 2003). PKT claims that the system does not evaluate creativity and 
reflective thinking. It does, however, assess “expository essays on factual 
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topics” such as description of a psychological theory or function of the 
heart (Murray, 1998). IEA™’s future plans include moving from global 
assessment features, such as flow and coherence, to more specific ones 
such as the voice and audience (Landauer et al., 2003). 

E-rater® and Criterion™ 

The electronic essay rater (e-rater®) was developed by the Educational 
Testing Service (ETS) to evaluate the quality of an essay by identifying 
linguistic features in the text (Burstein, 2003; Burstein 8c Marcu, 2000). 
E-rater® uses natural-language processing (NLP) techniques, which 
identify specific lexical and syntactical cues in a text to analyze essays 
(Burstein, 2003; Kukich, 2000). A detailed description of NLP and infor- 
mation regarding the structure and functions of e-rater and Criterion™ are 
provided below. 

Artificial Intelligence (Al) and Natural Language Processing (NLP) 

Artificial intelligence (Al) is defined as the science of making intelli- 
gent machines. Al has several applications including game playing, speech 
recognition, understanding natural language processing, computer vision, 
and so on 1 . 

NLP is considered to be one of the most challenging areas of AI. The 
research in NLP comprises a variety of fields including corpus-based 
methods, discourse methods, formal models, machine translation, natural 
language generation, and spoken-language understanding (Salem, 2000). 
There have been several empirical methods used in NLP. Previous methods 
(e.g., rationalist methods) required manual encoding of linguistic knowl- 
edge, which has proven to be difficult due to the complex nature of human 
language. Recent methods (e.g., empirical methods), however, employ 
techniques that automatically extract linguistic knowledge from large-text 
corpora. In other words, empirical methods employ statistical or machine 
learning techniques to train the system on large amounts of authentic 
language data (Brill 8c Mooney, 1997). 

NLP is claimed to be a complex task to comprehend because it con- 
tains several levels of processing and subtasks. It has four categories of 
language tasks including speech recognition, syntactic analysis, discourse 
analysis, information extraction, and machine translation. Speech recogni- 
tion focuses on diagramming a continuous speech signal into a sequence 
of known words. Syntactic analysis, on the other hand, determines the 
ways words are clustered into components like noun- and verb-phrases. 
Semantic analysis involves diagramming a sentence to a type of meaning 
representation such as a logical expression. Whereas discourse analysis 
focuses on how context impacts sentence interpretation and information 
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extraction locates specific pieces of data from a natural language docu- 
ment. Finally, the task of machine translation is to translate text from one 
natural language to another such as English to German or vice versa (Brill 
& Mooney, 1997). 

E-rater® 

E-rater® was initially used by ETS for operational scoring of the 
Graduate Management Admissions Test Analytical Writing Assessment 
(GMAT AWA; Burstein, 2003; Burstein & Chodorow, 1999; Burstein & 
Marcu, 2000) and had been employed for scoring the AWA since February, 
1999 (Burstein, 2003). However, as of January, 2006, ACT, Inc. started 
scoring GMAT essays using IntelliMetric™, which is Vantage Learning’s 
automated essay scoring engine (Rudner, Garcia, & Welch, RR-05-08). The 
GMAT AWA is currently scored by two human raters on a 6-point holistic 
scale, with 6 being the highest score and 1 the lowest. If two raters differ 
by more than 1 point, a third rater is called for resolution (Burstein, 2003; 
Burstein & Chodorow, 1999). The test-taker’s final score is determined 
through e-rater and one human-scorer. Similar to the prior practice with 
human raters, if there is a discrepancy between e-rater and the human 
rater by more than 1 point, a second human rater is included (Burstein, 
2003). To date, ACT, Inc. continues to use to IntelliMetric™ scoring proce- 
dure (Rudner, Garcia, & Welch, 2005). 

E-rater® employs a corpus-based approach to model building, in which 
actual essay data are used to examine sample essays. A corpus-based 
approach of building NLP-based tools requires researchers to usually use 
copy-edited text sources like newspapers. However, e-rater®’s feature 
analysis and model building require unedited text corpora that represent 
the particular genre of first-draft student essays (Burstein, 2003; Burstein, 
Leacock, & Swartz, 2001). 

The features of e-rater® include a syntactic module, a discourse module, 
and a topical-analysis module. These modules provide outputs for model 
building and scoring. E-rater® has been trained on a set of essays scored 
by at least two human raters on a 6-point holistic scale to build models 
(Burstein, 2003; Burstein & Chodorow, 1999; Burstein et al., 2003; Burstein 
& Marcu, 2000). The origin of the syntactic module is parsing. In order 
to capture syntactic variety in an essay, “a parser identifies syntactic 
structures, such as subjunctive auxiliary verbs and a variety of clausal 
structures, such as complement, infinitive, and subordinate clauses” 
(Burstein, Chodorow, & Leacock, 2003, p. 1). The discourse module uses 
a conceptual framework of conjunctive relations including cue words 
(e.g., using words like “perhaps” or “possibly” to express a belief), terms 
(e.g., using conjuncts such as “in summary” and “in conclusion” for sum- 
marizing), and syntactic structures (e.g., using complement clauses to 
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identify the beginning of a new argument) to identify discourse-based 
relationship and organization in essays (Burstein, 2003; Burstein & 
Chodorow, 1999; Burstein et al., 2003; Burstein & Marcu, 2000; Burstein, 
Kukich, Wolff, Lu, & Chodorow, 1998). Finally, the topical analysis module 
identifies vocabulary usage and topical content (Burstein, 2003; Burstein 
et al., 2003; Burstein 8t Marcu, 2000). Unlike a poor essay, a good essay 
needs to be relevant to the topic assigned. Moreover, the variety and type of 
vocabulary used in good essays differ from that of poor essays. The assump- 
tions behind this module are that good essays resemble other good essays. 
A similar assumption is valid for poor essays, as well (Burstein 8t Chodorow, 
1999; Burstein et al., 1998). The general procedure for a vector-spec model 
(Salton, as cited in Burstein & Marcu, 2000), which is used to capture 
the topic or vocabulary usage (Burstein & Chodorow, 1999; Burstein 
et al., 2003; Burstein et al., 1998; Burstein 8t Marcu, 2000), is described 
as follows: 

...training essays are converted into vectors of word frequencies, 
and the frequencies are then transformed into word weights. These 
weight vectors populate the training space. To score a test essay, it 
is converted into a weight vector, and a search is conducted to find 
the training vectors most similar to it, as measured by the cosine 
between the test and training vectors. The closest matches among 
the training set are used to assign a score to the test essay (Burstein, 
2003, p. 117). 

Here, a vector can be described as the mathematical representation of 
an essay. Moreover, word frequencies can be calculated by counting the 
words in a paragraph and dividing by the number of their occurrence (each 
time a word appeared in a paragraph). While word weights refer to the 
frequency divided by the number of words in an essay, training space refers 
to the entire set of vectors that were generated from the training essays. 
Finally, in this context cosine is the distance between test- and training- 
vectors. Figure 3 provides the graphic representation regarding the 
transformation of training essays into vectors of word frequencies, then 
into each word frequency, and finally into word weight. In short, training 
essays are converted into vectors of word frequencies and then into weight 
vectors. 
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Figure 3: Transformation of Training Essays into Vectors 



To summarize, e-rater uses NLP to identify the features of the faculty- 
scored essays in its sample collection and store them-with their associated 
weights-in a database. When e-rater evaluates a new essay, it compares 
its features to those in the database in order to assign a score. Because 
e-rater is not doing any actual reading, the validity of its scoring depends 
on the scoring of the sample essays from which e-rater’s database is 
created (Educational Testing Service, n.d.). 

Criterion SM 

Criterion™ is a web-based essay scoring and evaluating system, which 
relies on other ETS technologies called e-rater® and Critique writing anal- 
ysis tools. As discussed in detail above, e-rater is an automated essay 
scoring system. As a writing analysis tool, Critique includes a group of 
programs that identify errors in grammar, usage, and mechanics and 
that recognize discourse elements and elements of undesirable style in 
an essay. Besides providing instant holistic scoring, Criterion™ also gives 
individualized diagnostic feedback based on the types of evaluations that 
teachers give when responding to student writing (Burstein et al., 2003). 
The feedback component of Criterion™ is called an advisory component. The 
advisory component functions as a supplement to the e-rater score, but 
does not determine the score (Burstein, 2003). The feedback types that the 
advisory component contains are as follows: 

a. The text is too brief to be a complete essay (suggesting that 
student write more). 

b. The essay text does not resemble other essays written about 
the topic (implying that perhaps the essay is off-topic). 

c. The essay response is overly repetitive (suggesting that the 
student use more synonyms) (Burstein, 2003, p. 119). 
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Along with holistic scoring, Criterion™ provides diagnostic feedback on 
grammar, usage, and mechanics; style and diction; and organization and 
development. Criterion™ covers a number of writing genres including per- 
suasive, descriptive, narrative, expository, cause and effect, comparison 
and contrast, problem and solution, argumentative, issue, response to lit- 
erature, workplace writing, and writing for assessment. It provides writing 
topics at various levels including elementary school (4 th and 5 th grades), 
middle school (6 th , 7 th , and 8 th grades), high school (9 th , 10 th , 11 th , and 
12 th grades), college (1 st year/placement and 2 nd year), upper division or 
graduate school (Graduate Record Examination® (GRE)), and non-native 
speakers of English (Test of English as a Foreign Language® (TOEFL)). 
The topics are taken from authentic retired ETS essay topics. They are 
obtained from various ETS testing instruments such as NAEP™ (National 
Assessment of Educational Progress) 2 , English Placement Test designed 
for California State University 3 , Praxis™ 4 , and TOEFL® 5 . Criterion™ is 
capable of analyzing essays on the topics for which it has been “trained.” 
A minimum of 465 essays scored by expert raters are required to train 
the system on a topic. However, teachers are not limited to use the topics 
in the Criterion™ library and they can create and assign their own topics. 
While holistic scoring can not be reported for teacher-created topics, it is 
possible to obtain feedback of every dimension of writing (ETS, n.d.). 

The electronic portfolio and writer’s handbook features aim to facili- 
tate the writing process for the students. The electronic portfolio allows 
students to store their first and subsequent drafts online. Writer’s 
handbook, on the other hand, provides students with opportunities to 
view feedback definitions, examples of correct and incorrect use, and 
an explanation of every error reported. Teachers have power over several 
features of Criterion™. They can manage student access to the program by 
activating/inactivating the website or setting start/fmish dates. Teachers 
can also control the student access to spell check, diagnostic feedback, or 
holistic scoring by turning on/off these features. Finally, teachers have an 
option to insert their own feedback within the student essay (ETS, n.d.). 

Besides its instructional use in classrooms, Criterion™ can also be used 
for remediation and placement purposes by the schools. Some schools 
use Criterion™ for benchmark testing. Some schools use the Criterion™ 
program for exit testing. In this case, both Criterion™ and a faculty reader 
assign a score to the given essay. If the difference between two scores 
is more than one point a third rater is included in the scoring process 
(ETS, n.d.). 
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IntelliMetric™ and MY Access!® 

IntelliMetric™, an AES system developed by Vantage Learning, is 
known as the first essay-scoring tool that was based on artificial intelli- 
gence (AI) (Elliott, 2003; Shermis & Barrera, 2002; Shermis, Raymat, & 
Barrera, 2003). Like e-rater®, IntelliMetric™ relies on NLR See the section 
about NLP above for more information. IntelliMetric™ was developed by 
Vantage learning and used by the College Board for placement purposes 
(Myers, 2003). MY Access!® is known as the instructional application 
of IntelliMetric™ (Vantage Learning, n.d.). More information about the 
structure and functions of IntelliMetric™ and MY Access!® is provided 
below. 

IntelliMetric™ 

Using a blend of artificial intelligence (AI), natural language processing 
(NLP), and statistical technologies, IntelliMetric™ is a type of learning 
engine that internalizes the “pooled wisdom” of expert human raters 
(Elliot, 2003, p. 71). As an advanced AI application for scoring essays, 
IntelliMetric™ relies on Vantage Learning’s CogniSearch™ and Quantum 
Reasoning™ technologies (Elliott, 2003; Shermis & Barrera, 2002; Shermis 
at al., 2003; Vantage Learning, 2001a, 2003c). CogniSearch™ is a system 
specifically developed for use with IntelliMetric™ to understand natural 
language to support essay scoring. For instance, it parses the text to ana- 
lyze the parts of speech and their syntactical relations with one another. 
This process assists IntelliMetric™ to examine the essay according to 
the main characteristics of standard written English (Vantage Learning, 
2003c). CogniSearch™ and Quantum Reasoning™ technologies together 
allow IntelliMetric™ to internalize each score point associated with cer- 
tain characteristics in an essay response and then apply it to subsequent 
scoring by the system (Elliott, 2003; Shermis & Barrera, 2002; Shermis et 
al., 2003; Vantage Learning, 2001a). This approach is claimed to be consis- 
tent with the procedure underlying holistic scoring (Elliot, 2003). It is also 
claimed that the scoring system “learns” the characteristics that human 
raters likely to value and those they find poor (Shermis & Barrera, 2002; 
Shermis et al., 2003). 

IntelliMetric™ needs to be trained with a set of pre-scored essays with 
known scores assigned by human raters. These essays are then used as 
a foundation to extract the scoring scale and the wisdom of the human 
raters (Vantage Learning, 2001a, 2002, 2003b, 2003c). The system 
employs multiple steps to analyze essays. First, the system internalizes 
the known scores in a set of training essays. In other words, the system 
infers the writing rubric and the essay features associated with each score. 
The second step includes testing the scoring model against a smaller set of 
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essays with known scores for validation purposes. Finally, once the model 
scores the essays as desired, it is applied to new essays with unknown 
scores (Shermis & Barrera, 2002; Vantage Learning, 2000b, 2003c). 

IntelliMetric™ evaluates over 300 semantic-, syntactic-, and discourse- 
related features in an essay by using AI and NLP technologies (Elliott, 
2003; Vantage Learning, 2001a). These text related features are identified 
as larger categories called Latent Semantic Dimensions (LSD) (Vantage 
Learning, 2003a). The LSD features are described in five broad categories. 
The first category, focus and unity (focus and coherence), uses the features 
that emphasizes a single point of view, cohesiveness and consistency in 
purpose, and main ideas in an essay. The organization category analyzes 
transitional fluency and logic of discourse. Examples include the introduc- 
tion and conclusion, coordination and subordination, logical structure, 
logical transitions, and the sequence of ideas in an essay. The third cat- 
egory, development and elaboration, examines the breadth of the content 
and the supporting ideas in an essay (e.g., vocabulary, elaboration, word 
choice, concepts, and support). The fourth category, sentence structure, 
focuses on sentence complexity and variety such as syntactic variety, sen- 
tence complexity, usage, readability, and subject-verb agreement. The fifth 
and final category of mechanics and conventions analyze whether the essay 
includes the conventions of standard American English such as grammar, 
spelling, capitalization, sentence completeness, and punctuation (Elliott, 
2003; Vantage Learning, 2001a, 2003a). Figure 4 (next page) displays the 
IntelliMetric™ Feature Model 6 . 
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Figure 4: IntelliMetric™ Feature Model 
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There are five key principles underlying the IntelliMetric™ system. 
First, IntelliMetric™ is modeled on the human brain. IntelliMetric™ “emu- 
lates the way in which the human brain acquires, stores, accesses and uses 
information” (Vantage Learning, 2003c, p. 5). Therefore, a neurosynthetic 
( neuro = brain and synthetic = artificially created) approach is used to dupli- 
cate the mental processes employed by the human expert raters. Second, 
IntelliMetric™ is considered to be a learning engine that obtains the 
necessary information by learning ways to examine the sample pre-scored 
essays by expert raters. In other words, by modeling the scoring process 
used by expert human raters, IntelliMetric™ learns the rubric and the 
essential characteristics for scoring an essay as well as the ways those char- 
acteristics are revealed in each score point. Its “error reduction function” 
allows IntelliMetric™ to increase its accuracy over time by detecting and 
“learning from” its mistakes. Third, IntelliMetric™ is systemic and based 
on a complex system of information processing. Another principle sug- 
gests that IntelliMetric™ is inductive. Its judgments are based on inductive 
reasoning and it makes inferences about how to analyze an essay based on 
the sample responses previously evaluated by expert human raters. Finally, 
IntelliMetric™ is multidimensional and non-linear. It employs multiple 
judgments that rely on multiple mathematical models. It is claimed that 
while many scoring systems are based on the General Linear Model (GLM), 
IntelliMetric™ uses a nonlinear and multidimensional approach to analyze 
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essays. It is also claimed that the writing process is more complex than 
the General Linear Model’s simplistic, approach which suggests that an 
essay score increases as the values of text features increase and vice versa 
(Vantage Learning, 2003c). 

One of the best attributes of IntelliMetric™ is that it is capable of 
evaluating essay responses in multiple languages including English, 
Spanish, Hebrew, Bahasa, Dutch, French, Portuguese, German, Italian, 
Arabic, and Japanese (Elliot, 2003) .The system could be applied in 
“Instructional” or “Standardized Assessment” modes. The instructional 
mode assists students with revising and editing processes by providing 
holistic and diagnostic feedback on five traits (see MY Access!® section 
below for more information). The Standardized Assessment mode provides 
a holistic score and feedback on various rhetorical and analytical dimen- 
sions of an essay as well as detailed diagnostic feedback on grammar, usage, 
spelling and conventions, if necessary (Elliott, 2003; Vantage Learning, 
2001a, 2003c). 

MY Access!® 

MY Access!® is a web-based writing assessment tool that relies on 
Vantage Learning’s IntelliMetric™ automated essay scoring system. The 
main purpose of the program is to offer students a writing environment 
that provides immediate scoring and diagnostic feedback; that allows 
them to revise their essays accordingly; and that motivates them to 
continue writing on the topic to improve their writing proficiency (Vantage 
Learning, n.d.). 

MY Access!® not only provides immediate diagnostic assessment 
of writing, but also constructive multilingual feedback for ELL learners 
in grades K-12. Currently, the system assigns essay topics and provides 
feedback in English, Spanish, or Chinese. However, the company plans 
to make this opportunity available for other languages in the future as 
well. Students have two options in using the MY Access!® program. One 
option is writing on a topic assigned in English, Spanish, or Chinese and 
receiving feedback in the same language. Another option is writing an 
essay in English and receiving feedback either in the native language or 
in English. Besides providing multilingual feedback, MY Access!® provides 
multilevel feedback - developing, proficient, and advanced - as well. The 
multilingual dictionary, thesaurus, and translator functions of the pro- 
gram allow students to receive definitions as well as synonyms of a specific 
word (Vantage Learning, n.d.). 

MY Access!® includes several features that aim to make the writing 
process more feasible and effective not only for students, but also for 
teachers. To begin with, the program provides a web-based environment to 
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the user. Second, MY Access!® relies on the IntelliMetric™ scoring system 
and is able to provide instant feedback and scoring for an essay. This 
feature not only ensures consistency and accuracy among teachers as well 
as schools, it also gives teachers more time to focus on instruction. Also, 
the analytic scoring and feedback on each of the five categories provides 
diagnostic feedback regarding the student’s writing ability. The program 
can provide individualized multilingual feedback (Spanish and Chinese) 
on different genres of writing such as informative, narrative, literary, and 
persuasive essays (Vantage Learning, n.d.). 

MY Access!® contains over 200 operational and pilot prompts that 
generate instant analysis of the essay. These prompts are based on reading 
texts as well as literature at grade levels and they are available in following 
academic levels: higher education (level 4), high school (level 3), middle 
school (level 2), and upper elementary (level 1). Teachers can provide their 
own prompts, as well, bearing in mind that the system will be unable to 
score their students’ essays because it first needs to be trained on about 
300 prompts to be able to score essays automatically. MY Access!® also 
offers a variety of writing tools such as writing dashboard and my portfolio, 
which aim to facilitate the essay writing process for students. The writing 
dashboard feature gives students the opportunity to see their weekly 
progress and the my portfolio feature allows students to view a list of 
completed assignments, scores, reports, comments, and so on (Vantage 
Learning, n.d.). 

Various teacher options allow teachers to have full control of the appli- 
cation of the program. For instance, teachers are able to create groups 
or customize the level as well as the type of feedback according to the 
proficiency level of the students. Moreover, teachers can add their own 
comments on student essays along with the feedback provided by the 
system. The view reports option allows teachers to generate up to ten 
different types of reports on their students’ progress. For instance, the 
student history report provides teachers with not only an analysis of errors 
based on the rule categories in the system, but also with the average 
performance assessments of students over time. Last but not least, the 
MY Access!® website includes parent letters in English, Spanish, and 
Chinese to enable teachers to provide parents an opportunity to get 
involved in their children’s learning process (Vantage Learning, n.d.). 
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Bayesian Essay Test Scoring sYstem™ (BETSY) 

The final automated essay scoring system to be discussed in this article 
is the Bayesian Essay Test Scoring sYstem or BETSY™, which was developed 
by Lawrence M. Rudner 7 . BETSY™ is not of the same ilk as the commer- 
cial AES products described above and therefore should be treated more 
as a research tool. A more detailed discussion about BETSY™ is presented 
below, following a brief overview of the Bayesian approach to AES. 

Bayesian approach 

Another approach used in AES employs Bayesian theorem. Bayesian 
methods have several applications such as identifying spam and other 
unwanted e-mails based on their similarity with previously classified 
e-mail, and sorting the resumes of job applicants into various job catego- 
ries according to their similarity to previously classified resumes (BETSY, 
n.d.). Several Microsoft products such as Answer Wizard of Office 95®, the 
Office Assistant of Office 97®, and numerous technical troubleshooters are 
other applications of the Bayesian approach (Rudner & Liang, 2002). 

There are two Bayesian models widely used in text classification: the 
Multivariate Bernoulli Model and the Multinominal Model. While the former 
views each essay as a special case of calibrated features, the latter views 
each essay as a sample of calibrated features. In the Bernoulli model, the 
conditional probability of presence of a specific feature is estimated by 
the proportion of essays within each category that include the feature. 
In Multinomial model, on the other hand, the probability of each score 
for a given essay is computed as the product of the probabilities of the 
features included in the essay. (BETSY, n.d.; Rudner & Liang, 2002). To 
summarize, the Bernoulli model investigates whether a specific feature 
exists in an essay or not, whereas the Multinominal model checks the 
multiple use of a specific feature in an essay (Rudner & Liang, 2002). The 
Bernoulli model computes relatively slowly compared to the Multinominal 
model (BETSY, n.d.). 

The Bayesian approach includes key concepts such as stemming, stop 
words, and feature selection. Stemming denotes the process of eliminating 
suffixes to get stems. For example, obtaining “educ” as a stem for educate, 
education, educates, educational, and educated. Stop words refer to various 
articles, pronouns, adjectives, and prepositions. Search engines do not list 
these types of words because they can cause large number of irrelevant 
results. One approach to feature selection is the reduction in entropy. By 
minimizing entropy, it is possible to pick the items with maximum poten- 
tial information gain (Rudner & Liang, 2002). 
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BETSY™ 

The underlying idea of BETSY™ is the classification of texts based on 
trained materials (Valenti, Neri, & Cucchiarelli, 2003). Rudner and Liang 
(2002) point out that the classification of Bayesian Computer Adaptive 
Testing (CAT) is extended from two categories (master/non-master) to a 
three- or four-point nominal or categorical scale (e.g. extensive, essential, 
partial, unsatisfactory) in BETSY™. While the Bayesian CAT classification 
is based on optimally selected items, BETSY™ uses a large set of items. The 
“items” refer to a large set of essay features. These essay features include 
content related features such as specific words and phrases, frequency of 
certain content words, form related features including number of words, 
sentence length, number of verbs, number of commas and others, e.g., the 
order certain concepts are presented and the occurrence of specific noun- 
verb pairs. (BETSY, n.d.; Rudner & Liang, 2002). 

BETSY™ needs to be trained on 1000 texts (Rudner, Garcia, & Welch, 
2005) to learn how to classify new documents based on the following steps: 
train words, evaluate database statistics, eliminate uncommon words, 
determine stop words, train word pairs, evaluate database statistics, elimi- 
nate uncommon word pairs, and perhaps score the training set and trim 
misclassified training texts (BETSY, n.d.). After the training, BETSY™ can 
be applied to a set of trial texts to determine classification accuracy, several 
new texts, or a single text. Essay scoring typically categorizes texts into 
two or more groups such as Pass/Fail and Advanced/ Proficient/Basic/Below 
Basic. Scoring is the major component of BETSY™ and several scoring and 
recalculation options allow users to identify what text to score and how to 
score it. The special options include using Microsoft Notepad® to analyze 
misclassifications, scoring an essay to get diagnostic feedback, and trim- 
ming misclassified training texts (Figure 5, next page) (BETSY, n.d.). 
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Figure 5: Scoring and recalculation window in BETSY™ 
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It is claimed that BETSY™ includes the best features of PEG™, LSA, 
and e-rater® along with its own essential characteristics. In addition, the 
system can be applied to short essays and various content areas. Moreover, 
it is simple to implement and easy to explain to non-statisticians (BETSY, 
n.d.; Rudner & Liang, 2002; Valenti et al., 2003). The research on BETSY® 
is limited; however, the software can be downloaded from the official 
website for free for research purposes. BETSY® is in the process of being 
converted to VisualBasic and it will be soon become open sourced. 

Summary and Discussion 

There have been several studies over the past three decades that have 
examined ways to apply technology to writing assessment. More recently, 
increasingly sophisticated computer technology has enable writing per- 
formance to be assessed using AES technology (Burstein, 2003; Hamp- 
Lyons, 2001; Rudner & Gagne, 2001; Rudner & Liang, 2002). As Attali 
and Burstein (2006) maintain, AES systems do not directly evaluate the 
intrinsic qualities of an essay as human raters do, but they use correlations 
of the intrinsic qualities to predict the score of an essay. The automated 
essay systems described in this article employ various techniques to pro- 
vide immediate feedback and scoring. While e-rater® and IntelliMetric™ 
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use NLP techniques, IEA™ is based on LSA. Moreover, PEG™ utilizes proxy 
measures (proxes) and BETSY™ uses Bayesian methods to assess the quality 
of an essay. Unlike PEG™, IEA™, or BETSY™, e-rater and IntelliMetric have 
instructional applications (e.g., Criterion™ and MY Access!®), as well. 
Finally, each AES system needs different numbers of essays to train the 
system. Table 1 below compares the AES systems. 

Table 1 : Comparison of AES Systems 


AES System 

Developer 

Technique 

Main Focus 

Instructional 

Application 

Number of 
Essays Required 
for Training 

PEG™ 

Page (1966) 8 

Statistical 

Style 

N/A 

1 00-400 

IEA™ 

Landauer, 
Foltz, & Laham 
(1997) 9 

LSA 

Content 

N/A 

1 00-300 

E-rater® 

ETS development 
team (Burstein, 
etal.,1998) 10 

NLP 

Style and 
content 

Criterion SM 

465 

IntelliMetric™ 

Vantage Learning 
(Elliot, et al., 
1998) 11 

NLP 

Style and 
content 

MY Access!® 

300 

BETSY™ 

Rudner 12 (2002) 

Bayesian text 
classification 

Style and 
content 

N/A 

1000 


One of the main advantages of AES system is that they can score essays 
instantly and provide immediate feedback. Teacher response is necessary 
for a student to improve his/her writing ability. However, for a teacher 
who teaches large classes, this can be quite time-consuming, which could 
possibly affect the frequency of the writing assignments given in class 
(Burstein et al., 2003). Since the appropriateness of feedback has been 
found to be highly individual specific and/or situation specific (Hyland, 
1998), it will be essential to consider an effective method both for 
analyzing a large number of essays, but at the same time for providing 
individual feedback. Instructional-based AES systems (e.g., Criterion™ 
and MY Access!®) make attempts to achieve this goal and aim to facili- 
tate writing evaluation in classrooms. They are designed to supplement 
teachers but not to replace them (e.g., by allowing teachers to include their 
own scoring and feedback as well as their own prompts in the program 
(ETS, n.d.; Vantage Learning, n.d.). 

Computers can also provide opportunities to increase practicality in 
the administration of large-scale writing assessment (Bereiter, 2003). 
Employing human raters to score essays could be quite expensive in terms 
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of time and resources. Large scale standardized writing tests require more 
than one rater in order to increase the accuracy in scoring and reduce the 
bias the individual scorers might have. Since hiring multiple raters and 
training them on a scoring rubric is necessary but costly, including an AES 
system in the assessment process could be a cost-effective option (Bereiter, 
2003; Chung & O’Neil, 1997; Page, 2003; Sired & Rizavi, 1999). 

Most AES systems tend to focus on product rather than process in 
writing. While the product approach views writing assessment as a summa- 
tive practice, the process approach views it as a formative practice. Drafting 
before submitting the final form of an essay is critical in process writing. 
Instructional-based AES systems (e.g., Criterion™ and MY Access!®) make 
efforts to support formative assessment by allowing students to save their 
first and subsequent drafts on the computer and revise them based on 
the feedback and scoring they receive either from the computer or from 
the teacher. Criterion™ allows teachers to turn off the scoring feature 
so that students continue drafting online. While the previous version of 
MY Access!® (5.0) included online portfolios only, the latest version (6.0) 
provides peer review opportunities and pre-writing activities to pro- 
mote process writing. As Shermis and Burstein (2003) pointed out the 
credibility of AES systems will increase when their use moves from a 
summative to a more formative assessment. 

AES systems are mainly developed based on English language. There 
are efforts to enable these programs to assess writing in various languages 
(Shermis & Burstein, 2003). Criterion™, MY Access!® and IntelliMetric™ 
currently include some features for English language learner (ELL) 
students. For instance, Criterion™ includes retired TOEFL (Test of 
English as a Foreign Language) prompts. Criterion™, MY Access!®, and 
IntelliMetric™ contain multilingual feedback capacity (ETS, n.d.; Vantage 
Learning, n.d.). 

As Warschauer and Ware (2006) pointed out while providing feedback 
in a student’s native language is helpful, ELL students might need more 
than translation to improve their writing ability in English. The developers 
of AES systems need to question whether their current practices address 
the needs of ELL students. For instance, it will be important to investigate 
if the style, content, and amount of feedback provided by AES programs 
are appropriate for the ELL population. 

One of the strongest objections to computerized scoring is that 
computers are not capable of assessing an essay as human raters do because 
computers do “what [they are] programmed to do” and do not “appre- 
ciate” an essay (Page, 2003, p. 51). Automated essay scoring systems have 
been criticized for eliminating the human element in writing assessment 
(Warschauer & Ware, 2006) and falling short of human interaction as well 
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as the sense of the writer and/or rater as a person (Hamp-Lyons, 2001). 
Landauer, Laham, and Foltz (2003) accept the fact that LSA may lack 
pertinent background information of the essay writers. However, Page 
(2003) argues against these claims by pointing out the high correlations 
between PEG™ and expert human raters. PEG™’s work obtaining high 
correlations by just looking at superficial surface features and recent 
reports on the high correlation of SAT essay scores with essay length 
indicate that modest accuracy may not be that hard to achieve. 

Another criticism is the construct objections. Construct objections 
question the extent to which computers measure variables that are crit- 
ical in scoring essays (Page, 2003). Both PEG™ and IEA™ have been criti- 
cized for their focus on essay constructs. The main focus of PEG™ is the 
surface features (e.g., word order and essay length) in writing rather than 
the meaning and content (Chung & O’Neil, 1997; Kukich, 2000). While 
IEA™ is superior to other AES systems in terms of assessing the content 
of an essay (Landauer et al., 2003; Rudner & Gagne, 2001), it fails to pro- 
vide information regarding word order (Chung & O’Neil, 1997; Landauer 
et al., 2003). Vantage Learning, PKT, and ETS are currently working on 
adding new and improved features in an effort to increase what are already 
remarkably high accuracy rates. See Table 1 on page 23 for more informa- 
tion regarding the main focus of other AES systems. 

An important issue with machine scoring is whether the computer 
can be fooled by writers or not (Page, 2003; Powers, Burstein, Chodorow, 
Fowles, & Kukich, 2001; Sireci & Rizavi, 1999). The developers of AES sys- 
tems try to employ algorithms to defend against writers who try to cheat 
the computer. For instance, PEG uses an algorithm to alert the odd ele- 
ments in an essay. When the computer flags an essay it is aside for human 
evaluation (Page, 2003). Similarly, Criterion™ and MY Access!® programs 
flag anomalous essays for human scoring (ETS, n.d., Vantage Learning, 
n.d.). An earlier version of PEG™ was found to be vulnerable to cheating. 
Since the system mainly employed word count, word length, essay length, 
number of semicolons or commas, and so on (Chung & O’Neil, 1997; Kukich, 
2000; Rudner & Gagne, 2001), it was possible to “trick” the computer by, 
for example, writing longer essays to receive higher scores (Kukich, 2000). 
A study was funded by the GRE (Graduate Record Examinations) Board 
to determine whether e-rater could be tricked into assigning a lower or 
higher value to an essay than it deserved (Powers et al., 2001). The results 
of the study revealed that e-rater might reward a poor essay. The find- 
ings suggested that e-rater was not ready to use by itself and it should 
be paired with human raters, particularly for high-stakes assessment 
purposes (Powers et al., 2001). One can argue whether the scores provided 
by one or two human raters are the right criteria. The AES systems could 
be more accurate considering that they are based on hundreds of reads of 
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the same essay and they are pooled across multiple raters. In other words, 
AES systems are likely to eliminate the between rater variance. 

One of the main characteristics of AES systems is that they need to 
be trained on a large set of pre-scored essay samples in order to be able to 
evaluate the student essays effectively (Burstein, 2003; Chung & O’Neil, 
1999; Elliott, 2003; Landauer et al., 2003; Rudner & Liang, 2002). The 
systems can only score the pre-scored prompts from their own libraries. 
Although teachers have the opportunity to assign their own prompts, the 
computer is not capable of scoring those prompts since it is not trained 
to assess the essays with unfamiliar prompts. Thus, the essays written on 
new prompts need to be scored either by teacher or an expert human rater. 
On the other hand, the AES systems are only as good as what they learn 
from the calibration sample. The calibration samples can be optimized by 
exposing the system to a large number of training essays on a particular 
prompt. For example, an AES system is able to score new NAEP essays if it 
is exposed to a large number of previously scored NAEP essays. 

The AES systems described in this paper are claimed to be accurate 
and valid. For example, in their 2002 study, Rudner & Liang reported that 
the Bayesian approach presented accurate results in text categorization 
as high as .80 (Rudner 8c Liang, 2002). The correlations and agreement 
rates between e-rater®, IEA™, IntelliMetric™, IEA™, or PEG™ and expert 
human raters have been found to be high, as well (Attali, 2004; Burstein 
8c Chodorow, 1999; Landauer et al., 1997; Nichols, 2004; Page, 2003; 
Landauer et al., 2003; Vantage Learning, 2000a, 2000b, 2001b, 2002, 
2003a, 2003b). While reviewing information regarding agreement, it 
is critical to understand the difference between exact agreement and 
adjacent agreement. Exact agreement requires two or more raters to assign 
same exact score on an essay (e.g., two raters assign 5 on a 1-6 scoring 
scale). On the other hand, adjacent agreement requires two or more raters 
to assign a score within one scale point of each other (e.g. one rater assigns 
5 and another rater assigns 6 respectively on a 1-6 point scoring scale) 
(Cizek8c Page, 2003; Elliott, 2003). It is clear that exact agreement is harder 
to achieve and that adjacent agreement results in higher agreement rates 
(Cizek 8c Page, 2003). Table 2 (next page) compares the agreement rates 
across three different constructed-response scoring modes (expert scoring, 
standard human scoring, and IntelliMetric™ scoring) that were used to 
assess the writing responses of eighth-grade students from a statewide 
testing program (Vantage Learning, 2002, p. 4). It shows how the adjacent 
agreement rates between humans and IntelliMetric™ and humans can be 
higher than the exact agreement rates. The study was conducted by Vantage 
Learning using the IntelliMetric™ program. First, two expert raters scored 
each essay response. Then, two traditional human scorers independently 
scored those responses, and finally, IntelliMetric™ scored each response. 
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In this study, while “expert” referred to an individual who had a degree 
in English as well as at least five years of experience in analyzing writings 
in large-scale writing assessment programs, “traditional human scorer” 
was defined as an individual who usually attended a one-day training 
session on writing assessment in a large, statewide scoring session in 
writing (Vantage Learning, 2002). 

Table 2: Comparison of Export Scoring, Human Scoring, 

and IntelliMetric Scoring 



Human 1 to 
Human 2 

Human 1 to 
IntelliMetric 

Human 2 to 
IntelliMetric 

Human 1 
to Experts 

Human 2 
to Experts 

IntelliMetric 
to Experts 

Exact 

.52 

.53 

.56 

.58 

.54 

.73 

Adjacent 

.94 

.96 

.95 

.96 

.97 

.99 

Discrepant 

.6 

.4 

.5 

.4 

.3 

.1 


Increasing the reliability of AES systems has always been of great 
interest to AES researchers. The most common way to enhance the 
reliability of an AES system is to calibrate the system with a large number 
of sample essays to make sure that it is well-trained. Another way could be 
using the accuracy as a function of alternative calibration pools. Employing 
different training sets will ensure the inclusion of more than one calibra- 
tion pool, which might help better assess the reliability of AES systems. 

MY Access!® and Criterion™ are student based tools that have emerged 
from a computer technology that was originally created to help testing 
organizations score large numbers of essays. Currently, these systems 
are being used in writing classes at various schools and universities as 
writing tools. While AES systems assist teachers in writing classes, they 
are not free of charge. In the future, it would be interesting to see these 
systems as a public utility rather than a proprietary vendor-created-and- 
owned system. For instance, federal government could use NAEP essays 
and collect writing samples so that new essay prompts would be available 
for teachers to use in their writing classes. This would allow more teachers 
and students to benefit from the AES systems in writing classrooms. 

The demand for incorporating AES systems in writing assessment 
is increasing. Although some teachers and educators may fear that AES 
technology will eventually substitute humans, the producers of classroom- 
based AES systems (e.g., MY Access!® and Criterion™) claim that the main 
role of these systems is not to replace teachers in writing classes but to 
assist them. An effective way of using AES technology to score essays is to 
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incorporate the AES system into the writing evaluation process as a second 
or third rater. As Monaghan and Bridgeman (2005) suggested, using an 
AES system as a check point to compare the scores assigned by human 
readers can be an effective way of incorporating the AES technology in 
writing assessment. In other words, the AES systems can be used both 
to verify human scoring and to represent a collection of human judges in 
large-scale writing assessments. 

Today AES systems are widely being used as instructional tools in 
classrooms (e.g., MY Access!® and Criterion™) and as a co-rater in scoring 
large-scale standardized writing assessments (e.g., ETS has used e-rater 
along with a human rater to score GMAT essays since 1999) without 
excluding the human element. Although AES is a developing technology 
(Shermis & Burstein, 2003) the search for better machine scoring is ongoing 
as investigators continue to move forward in their drive to increase the 
accuracy and effectiveness of AES systems. 
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Endnotes 

1. See http://www-formal.stanford.edu/jmc/whatisai/whatisai.html 
for more information. 

2. See http://www.ed.gov/programs/naep/index.html for more information. 

3. See www.ets.org/redirect/tests.html for more information. 

4. See www.ets.org/praxis for more information. 

5. See www.ets.org/redirect/tests.html for more information. 

6. The Editors of the JTLA have altered Vantage Learning’s original model in an effort 
to present the information more clearly (see Figure 4). Please refer to Vantage 
Learning (2003a, p. 73) to view the model’s original configuration. 

7. Lawrence Rudner is currently the chief statistician with the 
Graduate Management Management Admission Council (GMAC). 

8. See http://www.pearsonkt.com/papers/IEEEdebate2000.pdf for more information. 

9. See http://pareonline.net/getvn. asp?v=7&n=26 for more information. 

10. See http://www.edres.org/betsy/three_prominent.htm for more information. 

11. See http://www.vantage.com/pdfs/intellimetric.pdf for more information. 

12. See http://www.edres.org/betsy/history.htm for more information. 
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